##Introduction

Interpretation of multi-omics data is very challenging. The IntLIM package aims to integrate multiple types of omics data. Unlike other approaches, IntLIM is focused on understanding how specific multi-omic associations are affected by phenotypic features. To this end, we develop a linear modeling approach that describes how multi-omic associations are affected by phenotype. More information can be found on our publication “IntLIM: integration using linear models of metabolomics and gene expression data”The workflow involves the following steps: 1) Input multi-omic data files. 2) Filter data sets by analyte abundance and imputed values. 3) Run the linear model to extract pairwise interaction significance. 4) Filter results by p-values and/or Spearman correlation differences, interaction coefficient percentiles, and r-squared values. 5) plot/visualize specific analyte associations.

##Installation

IntLIM is an R package and can be run on version >= 3.2.0.

Download (or upgrade) R here: https://cloud.r-project.org/

RStudio (an interface to R than can make R easier to use) can be download here (not required): https://www.rstudio.com/products/rstudio/download3/

The function install_github() from the “devtools” package (Wickham and Chang, 2015) installs IntLIM directly from GitHub. To install IntLIM, enter the following:

if(!require("devtools")){
  install.packages("devtools")
}
library("devtools")
install_github("ncats/IntLIM@liz_dev")

IntLIM can be loaded using the library function

library(IntLIM)

##Inputting Analyte Level Data

The first step is importing in the multi-omic data. To this end we have provided a sample data file for users wishing to use IntLIM.

IntLIM requires a specific format for multi-omic data. For IntLIM, we require the data corresponding to two analyte types, optional metadata for each of these types, and sample meta data. We also need a CSV meta-file that lists the location of the other files. These need to be in the same folder. The formats are described below.
In addition, we provide a sample set of files.

Please be sure that all files noted in the CSV file, including the CSV file, are in the same folder. Do not include path names in the filenames.

Users need to input a CSV file with two required columns: ‘type’ and ‘filenames’.

The CSV file is expected to have the following 2 columns and 6 rows:

  1. type,filenames
  2. analyteType1,myfilename (optional if analyteType2 is provided)
  3. analyteType2,myfilename (optional if analyteType1 is provided)
  4. analyteType1MetaData,myfilename (optional)
  5. analyteType2MetaData,myfilename (optional)
  6. sampleMetaData,myfilename"

The data and meta-data is stored in a series of comma-separated-values (.CSV) files.
The 5 files consist of data for two analyte types, metadata for two analyte types, and sample meta data. A meta-file lists the location of the other 5 files.
This meta-file is input into IntLIM.

Please be sure to normalize your data appropriately before inputting it into IntLIM.

Input data files should be in a specific format:

File type Description
analyteType1 rows are analytes from type 1 (e.g. metabolites), columns are samples
analytetype2 rows are analytes from type 2 (e.g. genes), columns are samples
analyteType1MetaData rows are analytes, features are columns
analyteType2MetaData rows are analytes, features are columns
sampleMetaData rows are samples, features are columns

For the analyte data files, the first row contains the feature IDs and the first column contains the sample IDs.

For the sampleMetaData, the first column of the sampleMetaData file is assumed to be the sample ID, and those sample IDs should match the first row of analyte data (e.g. it is required that all sample IDs in the analyte data are also in the sampleMetaDatafile).

Additionally, the analyte data files and SampleMetaData need to contain an ‘id’ column that contains the name of the features (analytes) or sample (sample id, name, etc).

A small data set is embedded in a package. This consists of the National Cancer Institute-60 (NCI-60) cell line data with a reduced number of genes for faster calculation. To access it use the following commands. csvfile describes the location of the meta-file describing the location of the other 5 input files.

dir <- system.file("extdata", package="IntLIM", mustWork=TRUE)
csvfile <- file.path(dir, "NCItestinput.csv")
csvfile
## [1] "H:/R/R-4.1.0/library/IntLIM/extdata/NCItestinput.csv"

Through the ReadData() function, users input the above .csv files containing analyte data and metadata. Note: It is assumed that all data has been pre-scaled and preprocessed The input data, sample data, and metadata are read into an IntLimData object, which will be used for all further analysis.

inputData <- IntLIM::ReadData(inputFile = csvfile,analyteType1id='id',analyteType2id='id',
                              class.feat = list(panelnbr = "factor",
                                                 cellnbr = "numeric",
                                                 PBO_vs_Leukemia = "factor",
                                                drugscore = "numeric"))
## [1] "IntLIMData created"

The ShowStats() function allows the user to summarize the data (how many total analytes of each types and samples in each data-set as well as samples in common).
We have a data-set that consists of 20 cell lines (samples), 1448 genes, and 257 metabolites.

IntLIM::ShowStats(IntLimObject = inputData)
##   Num_Analytes_Type1 Num_Analytes_Type2 Num_Samples
## 1                257               1448          20

##Filtering and Observing Data

Optionally, the FilterData() function allows the user to filter out the features (analytes) based on their mean values. Users should input a percentile cutoff and any feature with a mean value below that cutoff will be removed.
Furthermore we can filter out analytes by percentage of missing or imputed values.
For the analysis, we filter out the genes with the lowest 10% of gene expression and metabolites with more than 80% imputed values. This is done as below. We henceforth have 1303 genes and 212 metabolites for 20 cell line samples.
The “stype” variable is the phenotype or outcome of interest. This function checks that the “stype” values are the same for all input data and retains the phenotype data for later use. The optional “class.covar” variable coerces covariates to the given classes.

inputDatafilt <- IntLIM::FilterData(inputData,
                                    analyteType1perc = 0.10, analyteType2perc = 0.10,
                                    analyteType2miss = 0.80)

The PlotDistribution() function allows users to produce a boxplot of the distribution of analyte level data. This is done as below.

IntLIM::PlotDistributions(inputData = inputDatafilt)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Prior to running the model, the user can also perform a principal component analysis of the analyte data using the PlotPCA() function. The stype command allows the user to select a column from the sample meta data that color-codes the samples into two categories (two cancer types, tumor vs.  non-tumor, etc). For the sample set, we select PBO (prostate/breast/ovarian cancers) vs. the leukemia group.

IntLIM::PlotPCA(inputData = inputDatafilt,stype = "PBO_vs_Leukemia")

##Run IntLIM The linear models are run by the RunIntLim() function. The stype command allows the user to select a column from the sample meta data for the two categories to be compared (two cancer types, tumor vs. non-tumor, etc). Currently, IntLIM supports continuous data and comparison of two categories (“PBO_vs_Leukemia” is selected). The resulting object from the analysis is an IntLimResults object containing slotsfor un-adjusted and False Discovery Rate (FDR)-adjusted p-values for the “(g:p)” interaction coefficient. A significant FDR-adjusted p-value implies that the slope of gene-metabolite association in one phenotype is different from the other. The RunIntLim() function is based heavily on the MultiLinearModel functions developed for the ClassComparison package part of oompa (http://oompa.r-forge.r-project.org).

The following variations of this function are available: - Setting save.covar.pvals to TRUE returns the p-values and coefficients of all terms in the model (not only the interaction term p-value and coefficient). - Setting remove.duplicates to TRUE filters analyte association models for the same type of output and independent variable type to include only the model with the highest p-value. e.g. if m1 ~ m2 has a higher p-value than m2 ~ m1, then the m2 ~ m1 model will be returned. By default, only self-associations (i.e. m1 ~ m1) are removed.

myres.gm <- IntLIM::RunIntLim(inputData = inputDatafilt,stype="PBO_vs_Leukemia", 
                              covar = c("cellnbr"), outcome = 1, 
                              independent.var.type = 2,
                              save.covar.pvals = TRUE)
## [1] "Running the analysis on"
## 
## Leukemia      PBO 
##        6       14 
## [1] "10 % complete"
## [1] "20 % complete"
## [1] "30 % complete"
## [1] "40 % complete"
## [1] "50 % complete"
## [1] "60 % complete"
## [1] "70 % complete"
## [1] "80 % complete"
## [1] "90 % complete"
## [1] "100 % complete"
##    user  system elapsed 
##   10.92    0.12   12.13

The DistPValues() function allows the user to observe a histogram of the p-values. By default, FDR-adjusted p-values are shown. However, users can also choose to plot nominal p-values.

IntLIM::DistPvalues(IntLimResults = myres.gm)

IntLIM::DistPvalues(IntLimResults = myres.gm, adjusted = FALSE)

The DistRSquared() function allows the user to observe a histogram of the r-squared values.

IntLIM::DistRSquared(IntLimResults = myres.gm)

The PValueBoxPlots() function allows the user to observe the p-value distributions of all covariates in the linear models prior to adjustment. Note: To view these values, you must set save.covar.pvals to TRUE when running RunIntLim.

IntLIM::PValueBoxPlots(IntLimResults = myres.gm)

The InteractionCoefficientGraph() function allows the user to observe the interaction coefficients of all models, sorted in ascending order.

IntLIM::InteractionCoefficientGraph(inputResults = myres.gm, 
                                    interactionCoeffPercentile = 0.9)

The PlotPair() function allows the user to plot a chosen analyte association for selected groups. An example is shown below for DLG4 vs.  (p-Hydroxyphenyl)lactic acid.

Plot a pair of interest:

IntLIM::PlotPair(inputData = inputDatafilt,
                 inputResults = myres.gm,
                 outcome = 1, 
                 independentVariable = 2, 
                 outcomeAnalyteOfInterest = "(p-Hydroxyphenyl)lactic acid", 
                 independentAnalyteOfInterest = "DLG4")

The pvalCorrVolcano() function allows users to observe a volcano plot comparing the Spearman correlation difference between groups to the –log10(FDR-adjusted p-value). Note that these functions are not applicable to continuous outcome data.

IntLIM::pvalCorrVolcano(inputResults = myres.gm, inputData = inputDatafilt, 
                        diffcorr = 0.5, pvalcutoff = 0.05)
## Warning in IntLIM::ProcessResults(inputResults, inputData, diffcorr = 0, :
## The following independent variables have a standard deviation of 0 in one
## or more classes and will not be evaluated for differential correlation:
## arabinose, gamma.glu.cys, glyceric.acid, N.formyl.L.methionine, oxitryptan,
## porphobilinogen, X.1126, X.2193, X.2272, X.2768, X.2781, X.3022, X.3138, X.3363,
## X.3365, X.3394, X.3400, X.3489, X.3830, X.3832...possible.phenol.sulfate,
## X.3893, X.3955, X.4429, X.4503, X.4510, X.4514, X.4522, X.4906,
## X2..deoxyuridine.5..triphosphate, X5.6.dihydrouracil
## [1] "144633 pairs found given cutoffs"

Process the results and filter pairs of analytes based on adjusted p-values and differences in correlation coefficients between groups 1 and 2. Then plot heatmap of significant analyte pairs

##Filter Results

The ProcessResults() function allows the user to filter the results by FDR p-values (default set at 0.05) and by the absolute value difference of the analyte Spearman correlation (default set at 0.50) between the two groups. The output is a list of analyte pairs and analyte Spearman correlations for each of the two groups. Note that the diffcorr and corrtype arguments must be set to NA for continuous phenotypes.

myres.gm.sig <- IntLIM::ProcessResults(inputResults = myres.gm, 
                                       inputData = inputDatafilt, 
                                       pvalcutoff = 0.10, diffcorr = 0.5, 
                                       rsquaredCutoff = 0.2)
## Warning in IntLIM::ProcessResults(inputResults = myres.gm, inputData
## = inputDatafilt, : The following independent variables have a standard
## deviation of 0 in one or more classes and will not be evaluated for
## differential correlation: arabinose, gamma.glu.cys, glyceric.acid,
## N.formyl.L.methionine, oxitryptan, porphobilinogen, X.1126, X.2193, X.2272,
## X.2768, X.2781, X.3022, X.3138, X.3363, X.3365, X.3394, X.3400, X.3489, X.3830,
## X.3832...possible.phenol.sulfate, X.3893, X.3955, X.4429, X.4503, X.4510,
## X.4514, X.4522, X.4906, X2..deoxyuridine.5..triphosphate, X5.6.dihydrouracil
## [1] "1746 pairs found given cutoffs"

The CorrHeatmap() produces a clustered heatmap of the analyte Spearman correlations for each of these groups.
Note: This function is only applicable to discrete phenotypes.

IntLIM::CorrHeatmap(inputResults = myres.gm.sig)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

The HistogramPairs() function plots the count of analyte pairs by either outcome or independent variable analyte.

IntLIM::HistogramPairs(myres.gm.sig, type = "outcome")

IntLIM::HistogramPairs(myres.gm.sig, type = "independent")

Of note, plots are generated using Highcharter (http://jkunst.com/highcharter/) and Plotly (Sievert, et al.) (https://plot.ly), which enables interactive visualization, allowing users to promptly assess the effect of changing parameters on analysis results and accelerating discovery of phenotype-specific analyte pairs. This will greatly allow the workflow to be accessible to non-computational biologists.

The OutputData() and OutputResults() function allows the users to output the data and results of the analyses into zipped CSV files.

Lastly, various writing functions are implemented. The OutputData() and OutputResults() function allows the users to output the data and results of the analyses into zipped CSV files.

IntLIM::OutputData(inputData=inputDatafilt,filename=paste(result_dir, "FilteredData.zip", sep = "\\"))
IntLIM::OutputResults(inputResults=myres.gm.sig,filename=paste(result_dir, "MyResults.gm.csv", sep = "\\"))

##Run Cross-Validation You can also choose to run end-to-end cross-validation. This combines the steps of reading, filtering, running IntLIM, and processing the results for multiple folds. PlotFoldOverlapUpSet creates an UpSet plot of the significant pairs within each fold and common between folds.

crossValResults <- RunCrossValidation(inputData = inputData, analyteType1perc = 0.10, 
                   analyteType2perc = 0.10, analyteType2miss = 0.80,
                   stype="PBO_vs_Leukemia", covar = c("cellnbr"), outcome = c(1), 
                   independent.var.type = c(2), save.covar.pvals = TRUE,
                   pvalcutoff = 0.10, diffcorr = 0.5, rsquaredCutoff = 0.2,
                   folds = 4)
IntLIM::PlotFoldOverlapUpSet(crossValResults$processed)

##References

Siddiqui JK, Baskin E, Liu M, Cantemir-Stone CZ, Zhang B, Bonneville R, McElroy JP, Coombes KR, Mathé EA. IntLIM: integration using linear models of metabolomics and gene expression data. BMC bioinformatics. 2018 Dec;19(1):81.

Sievert, C., et al. plotly: Create Interactive Web Graphics via’plotly. js’. 2016. R package version 3.6. 0. In.

Wickham, H. and Chang, W. devtools: Tools to make developing R code easier. R package version 2015;1(0).